Intro to NGS processing

James A. Fellows Yates

2021-08-17

Who am I?

  • Education
    • B.Sc. Bioarchaeology (University of York, UK)
    • M.Sc. Naturwissenschaftliches Archäologie (University of Tübingen, DE)
    • Ph.D. Archaeogenetics (MPI-SHH / MPI-EVA, DE)
  • Experience
    • Number of genetics classes taken: 0
    • Number of bioinformatics classes taken: 0

@jfy133

Today we will

  1. Introduce what DNA sequencing is
  2. Explain how Illumina NGS sequencing data is generated
  3. How to evaluating NGS data [Practical]

Introduction DNA

What is DNA?

Deoxyribonucleic acid (/diːˈɒksɪˌraɪboʊnjuːˌkliːɪk, -ˌkleɪ-/ (DNA) is a molecule composed of two polynucleotide chains that coil around each other to form a double helix carrying genetic instructions for the development, functioning, growth and reproduction of all known organisms and many viruses. - Wikipedia

What is DNA?

Structure ADN

What is DNA?

Structure ADN

The rules

  • Four nucleotides
    • Pyrimidines: Cytosine, Thymine
    • Purines: Guanine Adenine &
  • Base pairing: one pyrimidine with one purine
    • C with G (think: CGI)
    • A with T (think: AT-AT walker)
  • Complementary
    • C on one strand, G on the other (or v.v.)
    • A on one strand, T on the other (or v.v.)

AT-AT Walker

The rules

  • Make copy of a DNA strand with a polymerase
    • Unwind the DNA
    • Separate the strands
    • Make new strand: find a C, get new G (etc)

DNA replication split

How do we get DNA?

Figure 17 01 02

Introduction to DNA Sequencing

What is Sequencing?

Converting the chemical nucleotides of a DNA molecule

to

ACTG on your computer screen

Historically

  • Sanger sequencing

Sanger-sequencing

  • Separate strands, add primer (starting point)
  • Add mix of nucleotides, some with special ‘terminators’
  • Pass through size-filtering, read order of terminators

Pros and cons of Sanger Sequencing

  • Pros
    • More precise (less errors)
    • Longer reads
  • Cons
    • Resource heavy: lot of input DNA
    • Slow: one. fragment. at. a. time.

What is NGS?

  • NGS: Next Generation Sequencing
    • MASSIVELY multiplexed!
    • Sequence millions and even billions of DNA reads at once!

Not really ‘next’ anymore, consider it more ‘second’ generation (see: Nanopore)

What is NGS?

Market leader:

Illumina HiSeq 2500

(Others: Roche 454, PacBio, IonTorrent etc.)

How does it work?

  • Basically same concept, but:
    • no size separation
    • with pretty pictures!

i.e. attach fluorophore-modified nucleotides, (normally) one colour per base

A

G

T

C

Fire mah lazer, and take a picture! Rinse and repeat!

How does it work?

via Gfycat

Where does this happen?

On a ‘flow cell’

Next generation sequencing slide

Where does this happen?

But how do you get your DNA to attach to the lawn

(and not get lost)?

  • Convert it to library:
    • Add adapters: bind to the ‘lawn’ of the flow cell
    • Add indexes: sample-specific barcode
    • Add priming sites: where enzymes start copying DNA

AATGATACGGCGACCACCACaccgacaaCCCTACACGACGCTCTTCCGATCTXXXXXXAGCACACGTCTGAACTCCAGTCACgacactaCCGTCTTCTGCTTG ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| TTACTATGCCGCTGGTGGTGtggctgttGGGATGTGCTGCGAGAAGGCTAGAXXXXXXTCGTGTGCAGACTTGAGGTCAGTGctgtgatGGCAGAAGACGAAC

[Adapter & Index Primer] [Index] [Target primer] [Target] [Target primer] [Index] [Adapter & Index Primer]

Sequencing-by-synthesis

Add DNA to flow cell, but problem: florescence of one single nucleotide not enough…

Cluster Generation

Make lots of copies!

Sequencing-by-synthesis

Cluster Generation

  1. Add florescent nucleotides (complementary will bind)
  2. Fire laser & take photo
  3. Wash away unbound nucleotides
  4. Remove fluorophore
  5. Back to 1 ⤴️

What does this look like?

Cluster Generation

Improving quality

One problem, over time, imaging

Throughout limits

Paired end

Paired end sequencing

Once end, bendover, attach other end (turnaround) and start from the end of the molecule

Cons of NGS sequencing

  • less accurate (laser/photo can get wrong)
  • chemistry limits (DNA strands gets old through heat cycling for denautring; dirty environment from suboptiomal wash steps etc.) mean short reads (compensated by volume)